Reconstructing firm-level interactions in the Dutch input–output network from production constraints

Recent crises have shown that the knowledge of the structure of input–output networks, at the firm level, is crucial when studying economic resilience from the microscopic point of view of firms that try to rewire their connections under supply and demand constraints. Unfortunately, empirical inter-firm network data are protected by confidentiality, hence rarely accessible. The available methods for network reconstruction from partial information treat all pairs of nodes as potentially interacting, thereby overestimating the rewiring capabilities of the system and the implied resilience. Here, we use two big data sets of transactions in the Netherlands to represent a large portion of the Dutch inter-firm network and document its properties. We, then, introduce a generalized maximum-entropy reconstruction method that preserves the production function of each firm in the data, i.e. the input and output flows of each node for each product type. We confirm that the new method becomes increasingly more reliable in reconstructing the empirical network as a finer product resolution is considered and can, therefore, be used as a realistic generative model of inter-firm networks with fine production constraints. Moreover, the likelihood of the model directly enumerates the number of alternative network configurations that leave each firm in its current production state, thereby estimating the reduction in the rewiring capability of the system implied by the observed input–output constraints.

We can now reformulate the two steps of the scGM for sampling edges and their weights: the probability of observing a link between nodes i and j of type γ is given by where w tot γ = ∑ i ∑ j(̸ =i) w γ i j . This model can be fitted to data by using either the total density of connections to find z or by letting the density depend on the product layer. In the latter case z is replaced by z γ and it is found by letting This model might appear different from the scGM given that both the out-and in-strengths are now product-wise. However, under the assumption that each firm produces only one good that is the same for each business in a given industry, the generalization above and the model presented in the main text coincide. Under this restriction, then the product γ coincides with the industry producing it g i , giving γ = g i . It then follows that s γ i,out = s out i for γ = g i and zero otherwise and, similarly, s γ j,in = s g i → j given γ = g i . The generalization above also clarifies that input and output strengths are not per se treated differently; it is rather the choice of the proxy used to identify goods (in our case, the sector classification) that causes only one out-strength to be different from zero.

S.2 Entropy and likelihood in Exponential Random Graphs
Both the dcGM and the scGM derive from the general family of constrained, maximum-entropy graph ensembles. We recap here some of the important properties of this group of models.
Maximum-entropy graph ensembles are a result of the approach proposed by Jaynes to maximize Shannon entropy subject to constraints 1 . This can be achieved by maximizing the functional where S[P] is Shannon entropy, defined by S = ∑ G∈G −P(G) ln P(G) and {C m [G]} M m=1 is the set of constraints encoding known information about the ensemble. Solving the maximization problem yields the general form of Exponential Random Graphs (ERG) 2 : where Z( ⃗ λ ) is the partition function defined as Z( ⃗ λ ) = ∑ G∈G e − ∑ M m=1 λ m C m [G] . Given a real graph G * we would like to find the set of Lagrange multipliers {λ m } M m=1 that maximize the log-likelihood it can be shown that the maximum of the log-likelihood is found for the values ⃗ λ * ensuring that the expected value of the constraints matches their empirical value, i.e. ⟨C m ⟩ = C m [G * ] = C * m . Since the entropy of ERGs is given by now, if we insert the values ⃗ α * , ⃗ β * solving Eqs. (16) and (17), we get the exact relationship between entropy and minus log-likelihood as in Eq. (13): implied by the fitness ansatz, we obtain where L (G * | ⃗α , ⃗β ) is exactly the log-likelihood of the dcGM model, while S( ⃗α , ⃗β ) is the entropy of a DCM with degrees ⟨ ⃗ k out ⟩ dcGM and ⟨ ⃗ k in ⟩ dcGM , i.e. the maximum-entropy ensemble with degrees given by the fitness ansatz in Eqs. (20) and (21).
This relationship holds for the scGM as well; however, the fitness ansatz is no longer defined at the degree level but at the degree per sector level. A clearer way to imagine this is to consider the multi-layer description discussed in section S.1: there, the fitness ansatz applies to each layer independently. To simplify the notation, let Γ be the set of all possible industries (or products) and γ ∈ Γ denote a specific layer with only edges that come from a node of that sector. We can, then, write equivalently s g i → j or s in,γ j given that g i = γ. To avoid confusion here we note, as in section S.1, that we can write equivalently also s out i or s out,γ i if g i = γ, given the way layers are constructed. Then, we can say that the expected degrees in the scGM are given by: From the equations above we can clearly see that this is equivalent to applying the fitness ansatz on each layer independently and, as such, we must define the DCM on each layer as well. In this case, we can see that for each layer we will have two vectors of parameters ⃗ α γ , ⃗ β γ that in the DCM will be fitted in order to ensure that ⟨k out,γ i ⟩ and ⟨k in,γ i ⟩ are equal to the empirical values. For the scGM we can define a new set of values ⃗α γ , ⃗β γ given byα i,γ = − ln( . Upon substituting these ⃗α γ , ⃗β γ values into (22), we obtain the entropy of each layer. By noticing that the layers do not interact, then the total entropy is just the sum of each layer's entropy, which is given by: just like in the case of the dcGM, L (G * | ⃗α , ⃗β ) is the log-likelihood of the fitted scGM model and S( ⃗α γ , ⃗β γ ) is the entropy of the DCM model applied to layer γ with expected degrees ⟨k in,γ i ⟩ scGM and ⟨k out,γ i ⟩ scGM .
It is, therefore, true that in both models the negative log-likelihood −L (G * | ⃗α , ⃗β ) is still a measure of the number of alternative network configurations having those degrees as expected values. If Eqs. (20) and (21), or (25) and (26), are well obeyed by the data, then −L (G * | ⃗α , ⃗β ) is also a good approximation of the original entropy S(⃗ α * , ⃗ β * ), i.e. of the number of network configurations having the empirical degrees as expected values.

S.3 Relationship with block models
Let us, now, explore the relationship between the scGM and block models, specifically the Block Configuration Model (BCM) 6 . To derive the BCM, let us start from the definition of the contribution of each group to the nodes degrees. We can express these contributions as in line with the discussion in section S.2 it is possible to define an Hamiltonian accounting for the quantities above and that results in the probability of a link existing from node i in group g i to node j in group g j reading The multipliers α g i →g j i and β g i →g j j can, then, be found by imposing that the expected value of the constrained quantities equals their empirical value, that is: We can clearly see that if we constrain the values of the multipliers of the BCM with the following fitness ansatz we recover the p i j equation of the scGM. It is, therefore, possible to argue that the scGM is derived from the BCM in the same sense it is derived from the DCM: however, this is true only for the specific case of single product firms -that should be taken as a consequence of the available data rather than the approach. Our methodology, as discussed in section S.1, can be easily generalized to the multi-product case that requires us to consider a multi-layer network. We therefore consider the scGM a fitness-induced multi-layer DCM rather than a special case of the BCM.
Another approach present in the literature that has a strong resemblance to ours is the one using of a fitness model (dcGM) with a varying density that depends on the groups 7 . It modifies the dcGM link probability equation as follows where the z g i g j parameter is tuned to the desired density of connections between the two groups g i and g j . In our context, this would be achieved by setting z g i g j to satisfy δ g i r δ g j s a i j = L * g r g s ∀g r , g s This is different from the scGM in three important ways. The first one concerns the link densities to be preserved: the scGM only considers L g i and not L g i g j ; this implies that while the block-wise dcGM ensures that the probability of connection depends on the number of links between the two groups, the scGM considers only the density of outgoing links from g i independently of the group the edges are going to. This difference is however not at the core of the scGM methodology as it could be removed by letting the z depend on both groups also in the scGM.
A second minor difference concerns weights: the block-wise dcGM weights are assigned with the deterministic recipe w i j = s out i s in j w tot p i j a i j . This creates a strange paradox: given a node in group g i and two nodes in groups g j and g k with equal in-strength s in j = s in k , the weight of the sampled edge will be higher for the node pair with lower probability of connection. This is not a desired property as it runs contrary to the fitness ansatz at the core of the methodology. This issue is resolved in the scGM by letting the weight depend on the strengths by sectors rather than the total ones. This is informative of the more fundamental distinction between the two approaches: the block-wise dcGM does not consider the heterogeneity that might exist between firms in the same group. Indeed by having p i j depending only on z g i g j and the total strengths, there is no information on the production technology of each individual firm in terms of the product composition of its inputs. As such, in the block-wise dcGM, a link between firm i and j can exist even if firm j does not have any input from sector g i so long as there exists a firm in g j that does. As such, we think the block-wise dcGM to be unsuitable to analyse firm networks as it lacks some of the fundamental information necessary to correctly reproduce the true structure of production.

S.4 Fitness ansatz
Both the dcGM and the scGM rely on the simple hypothesis that strengths and degrees are positively correlated: if this were not the case for our datasets, both models would perform poorly. We qualitatively assess the veracity of this hypothesis by checking how well this correlation holds, at the various levels of sector definitions, in the empirical data. In figure S.1 (and figure S.8) we plotted the relationship between out-strengths and out-degrees as well as between in-strengths and in-degrees by sector, for an increasingly specific sector definition. We find that the fitness ansatz is well supported by the data in terms of trend and average, with both models being able to correctly capture the trend at both the aggregate and sector-specific level. Notice, however, that the spread of data points around the mean is not reproduced by either model -the predicted variance of the degree being less than, or of the same order of magnitude of, its expected value; this discrepancy is, again, due to the fitness ansatz which leads our models to predict degrees that are, by construction, proportional to the strengths, hence the expected monotonic trend. In each panel we are showing the degree/strength relation for the product layer specified. Note that edges exist on a layer if the node they come from belongs to the given sector. Coloured area represents the density of points in the empirical network for each bin of a two dimensional logarithmic binning. The purple rhombuses give the average empirical strengths computed over a logarithmic binning of the degree. The red dashed line shows the theoretical relationship between expected strength and degree in the fitted scGM. The red shaded area is computed by taking one hundred samples from the scGM ensemble and computing the 5th and 95th percentile value of the degree for each node and then smoothing the resulting points using a moving average over increasing values of strength. The area therefore represents the interval where 90% of sampled degree values will lie for a node of given strength.

S.5 Structural properties
In the main text we discussed how the dcGM is not able to accurately reproduce the trend of the average nearest neighbour strength. We can observe this phenomenon also via a direct comparison between the expected average nearest neighbours strengths and the empirical ones. If the ensembles had captured the correct average size of companies in the neighbourhood of each node, then all points in figure S.2 would cluster around the red dashed line. However, both the right plots in figure S.2 show that the majority of nodes are assigned a high value of the average nearest neighbours strength because all nodes are more likely to connected to the nodes with the highest total strength. By introducing restrictions by sector, the scGM still preserves the same trend, which is intrinsic to the fitness ansatz, but only within the correct sector. This allows the scGM to predict the neighbours of the nodes and, hence, their properties more accurately. This difference between the dcGM and the scGM is even more evident in figure S.3 where we have repeated the same comparison per node but splitting each node in-strength by its sector components. This allows us to visualize the error in reconstructing the in-strength by sector of the in-neighbours of each node. It is evident that the scGM performs significantly better than the dcGM as the point are more closely packed around the 45 degree line.

S.6 Identifying firms
The available data consists of all incoming and outgoing transactions on the accounts of the Dutch clients of the banks. As we are concerned only with inter-firm transactions we exclude from the analysis any account belonging to a private individual. For the ING data, it is known from ING accounts whether they belong to private individuals or clients. For the ABN AMRO dataset this is achieved using the Business Contact Database (BCDB) curated by the Commercial Banking (CB) department. In this database for each firm there is a hierarchical definition of the commercial complex it belongs to. This allows us to both remove private individuals that do not have an assigned commercial complex and to reconstruct the often complex hierarchical relationships that exists between accounts of the same firm. Note that there is a difference between the two approaches: the ING data is more inclusive guaranteeing a maximum coverage of all commercial clients; the ABN approach on the other hand is quite restrictive, any node included will definitely be a firm, however smaller firms might not be included if they have not been correctly labelled as complexes. The commercial complex is a definition that comes from the account manager for the commercial client that together with the client determines which accounts belong to the same firm and which among the available contacts is the main account holder. A similar structured grouping of accounts is achieved in the ING data using the legal (ultimate) parent information available in the transaction data from ING. Each commercial complex will be a single node in our network and all transactions within the group will be discarded while transactions between groups will be summed irrespective of the specific account on which the transfer was recorded. This step of identifying firm is a delicate but fundamental step. The transaction data is per se already a network as it encodes flows of money from one account to another. However this data cannot be used as is, because it would not be an accurate description of the input-output relationships between firms. This is because we are interested in constructing an input-output network where each node is a functional economic unit such that any incoming money into the account can be interpreted as the remuneration for a product or service of a known type. In order to achieve this, two things must be true: first only transactions that correspond to final consumption or intermediate use of products should be included, and second the accounts must be aggregated such that all expenses and income relating to production are assigned to the same node. You can better understand the importance of the second point by imagining a firm with multiple accounts, one used to pay employees, taxes and receive sales income, and one to pay suppliers. If we considered these accounts as independent nodes we would have broken the functional economic unit and its single balance sheet into two meaningless units and would be considering transfers between accounts as a sale of some product. We could, of course, exaggerate in the other direction, that is by aggregating more than necessary. Consider the case of a financial holding that owns multiple businesses: each business has its own accounts and balance sheet and should be treated as an independent node, if we were to consider all firms owned by the holding as a single node we would be ignoring the fact that these balance sheets operate independently and their production processes are not the same. The commercial complex definition used in our analysis has been developed with a similar objective in mind: that is to be able to analyse the total operating income and expenses of a firm as a whole.
A final step in the identification of the firm is constituted by the correct assignment of an industry label. For the purpose of this analysis we use the Dutch Standaard Bedrijfsindeling (SBI) hierarchical five digit codes for the ABN AMRO data and the North American Industry Classification System (NAICS) hierarchical six digit codes for the ING data. Note that the SBI codes are well aligned with the Statistical Classification of Economic Activities in the European Community (NACE) codes, as they share the same first four digits, and the International Standard Industrial Classification of All Economic Activities (ISIC), as the first two digits are the same. All Dutch firms are assigned an SBI code when they register at the Chamber of Commerce (Kamer van Koophandel, KvK) however given the complex legal structure that companies may adopt, not always is it clear which of the different codes associated with a commercial complex is the main one. In order to address this internally at ABN AMRO to each commercial complex has been associated a 'real' SBI which the account manager believes best captures the real economic activity of the firm. This curated information is particularly useful as often firms have a legal structure with a financial holding at its apex, meaning we would see an over-representation of the financial sector. The reason for using the NAICS codes for ING is that a similar manual overwrite of the sector code is done at ING but only on the NAICS codes. We have chosen to give precedence to this better quality data rather than coherence between the two datasets as the analysis does not rest on the codes being the same for both networks and that they are quite similar classifications.

S.7 Filtering transactions
The bulk of transactions available in the two datasets is composed of SEPA payments. Both however also contain Swift, as well as other international and internal transfers related among others to repayment of loans, credit due to financial investments or deposit of a loan, cash and card payments. As discussed above we want to keep only the payments relating to the sales of the companies, as such we need to eliminate all transactions related to investments and credit, as well as all internal transfers between accounts. We do so by considering all transactions between firms to be related to an exchange of a product or service and by eliminating all firms from the financial and insurance sector (SBI codes starting with 64, 65 or 66, and NAICS codes starting with 52) and the public administration (SBI codes starting with 84, and NAICS codes starting with 92). The remaining transactions will be related to either final consumption or to other credit and debit relationships. For the purpose of this analysis it is however sufficient for us to use only the firm to firm connections and ignore the household consumption.
Another important aspect of the data is the temporal dimension. The available data is extremely granular, however for the purpose of understanding the structure of the relationship between firms we would like to have a more static view. As such we chose to aggregate transactions on a yearly basis, so that a link from node A to node B represents the total flow of goods produced by firm A and bought by firm B in a given year measured in Euros. A year is chosen as it will reduce any effect that seasonality may have giving us a balanced representation of all firms. Indeed, if we chose a monthly or quarterly aggregation those sectors that see most of their activity in a given period of the year would be underrepresented in certain snapshots.

S.8 Incomplete information
It should be noted that unfortunately we cannot claim to have complete coverage of all inter-client transactions. There are two main limitations to our data: mediated transfers and unobserved direct flows. The first case is possibly the more problematic as there exist many payment service providers as well as other systems that mediate between firms. In particular there exist various services offered by financial institutions to pay upfront all credit invoices, or parts of them, taking upon themselves the risk of the customer not paying or paying late. These kind of transactions are precisely what we would like to include in our analysis. We know that these transfers exist in our data and we can in some cases identify one side of the transaction on the accounts, however linking sender and receiver is not easy. Further work could be dedicated to understanding better this issue and how to solve it possibly developing an algorithm to find matching pairs of unconnected flows. The second issue relates to the heterogeneity in the payment systems available to the clients. Other than SEPA transfers we have batch payments, cash deposits and withdrawals, card payments and credit, international transfers, and many others. In the case of cash it is of course impossible to determine the origin of the payment, however also in many other cases in our datasets it is not possible to clearly identify the counterpart to a transaction.
Furthermore the effect of limiting ourselves to inter-client transactions will introduce a bias in our analysis that will vary firm by firm. The proportion of observed to unobserved links will be determined by how connected the client is to firms that are not part of the bank's network and to businesses abroad. Unfortunately we are currently not able to determine if those accounts belong to a private individual or a firm, nor group the accounts as discussed in section S.6. We are also not able to correctly assign an industrial classification code to those firms. How this will affect our analysis is not easy to determine. The quality of our reconstruction depends fundamentally on the functional relation there is between fitness and probability of a link existing. In our case, where the fitness is the strength of the node, we can see two possibilities for a given firm: we observe a high strength but a lower than expected degree because few counterparties to this firm are in the network, but the weight of the transaction that are observed represent the majority of the "true" strength of the firm; we observe a low strength because the only transactions that are in our network are of low value, and hence the strength is not representative of the real probability of observing a link. Both cases are possible because we expect the weights of each link to be proportional to the size of both companies involved. Given the power law nature of the size of firms, we expect many transactions of low value and few that represent the majority of the strength. The second case in particular could result in fully disconnected nodes in our reconstructed samples and very low expected degree.

S.9 Descriptive statistics
The network constructed with the methodology so far described is composed of one giant connected component containing 99.56 % of nodes for the ABN data and 99.998% for the ING data. Removing the financial and government sectors does not affect significantly the number of components with the largest component still retaining 99.54% of all nodes for the ABN data and 99.01% for the ING data. In both cases the other nodes are mostly in dyadic relations with the maximum component size of 27 nodes. In the table below we have summarized the sector compositions of the two datasets. Note that some of the discrepancies are due to the difficulty in precisely translating between the two classifications standards.   (d)). The median sample line is computed by taking 100 samples from the scGM ensemble with all sector and computing the median strength for each node and then the distribution of these median values. This allows us to see the distribution of the strength for a typical sample, highlighting for example the lower probability in the scGM of observing low strength nodes. We find that the distributions of institution 1 are almost as likely to follow a power law or a log-normal distribution but with slight preference for a log-normal (p-value 0.67 for (a)) and for power law (p-value 0.90 for (c)). For institution 2 we find a statistically significant support for log-normality (p-value 2.2 · 10 −5 for (b) and 2.4 · 10 −8 for (d)). Here, the null hypothesis is that the power-law performs better than a log-normal distribution.  In each panel we are showing the degree/strength relation for the product layer specified. Note that edges exist on a layer if the node they come from belongs to the given sector. Coloured area represents the density of points in the empirical network for each bin of a two dimensional logarithmic binning. The purple rhombuses give the average empirical strengths computed over a logarithmic binning of the degree. The red dashed line shows the theoretical relationship between expected strength and degree in the fitted scGM. The red shaded area is computed by taking one hundred samples from the scGM ensemble and computing the 5th and 95th percentile value of the degree for each node and then smoothing the resulting points using a moving average over increasing values of strength. The area therefore represents the interval where 90% of sampled degree values will lie for a node of given strength. 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 10 12 Out strength 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 10 12 Average out-nn out strength original all sectors no sectors CI(0.05, 0.95) (a) Out-out 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 10 12 In strength 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 10 12 Average in-nn in strength